User overview analysis¶
In this analysis i am going to analyze the most essential component of it's business, its customers. I am going to look at the given in a detailed manner by classifying into different phases. During these phases, i am going to apply multiple exploratory techniques and compare results. This analysis is key in that it suggests as the next logical steps, qusetions or areas of research ahead.
In the user overview analysis we are going to go through the following steps:
Phase 1: Aggregate user behavior
- Loading the dataset
- Data cleaning
- Data transformation
Phase 2:User analysis
- Identify top 10 handsets
- Identify top 3 handset manufacturers
- Identify top 5 handsetsper top 3 manufacturers
- Provide interpretation on the above subpoints
Phase 3: Application behavior analysis
- Aggregate User behavior data
- Visualize user behavior
Phase 4: Exploratory data analysis(EDA)
- Non-graphical univariate analysis
- Graphical univariate analysis
- Bivariate analysis
- Correlation analysis
- Dimensionality reduction(PCA)
done with task 1(User overview analysis)
Phase 1¶
In this stage the code is going to try to understand the data through common techniques such as, looking for missing values, outliers and in general going through the data cleaning and data transformation stages.
pip install psycopg2-binary sqlalchemy
Requirement already satisfied: psycopg2-binary in c:\users\beab\desktop\kifiya aim\investor-feasibility-analysis\venv\lib\site-packages (2.9.9) Requirement already satisfied: sqlalchemy in c:\users\beab\desktop\kifiya aim\investor-feasibility-analysis\venv\lib\site-packages (2.0.34) Requirement already satisfied: typing-extensions>=4.6.0 in c:\users\beab\desktop\kifiya aim\investor-feasibility-analysis\venv\lib\site-packages (from sqlalchemy) (4.12.2) Requirement already satisfied: greenlet!=0.4.17 in c:\users\beab\desktop\kifiya aim\investor-feasibility-analysis\venv\lib\site-packages (from sqlalchemy) (3.0.3) Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip is available: 23.1.2 -> 24.2 [notice] To update, run: python.exe -m pip install --upgrade pip
import sys
import os
sys.path.append(os.path.abspath('../scripts'))
from load_data import load_data_from_postgres, load_data_using_sqlalchemy
# Define your SQL query
query = "SELECT * FROM xdr_data;" # Replace with your actual table name
# Load data from PostgreSQL
df = load_data_from_postgres(query)
# Display the first few rows of the dataframe
if df is not None:
print("Successfully loaded the data")
else:
print("Failed to load data.")
c:\Users\Beab\Desktop\Kifiya AIM\Investor-feasibility-analysis\scripts\load_data.py:38: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy. df = pd.read_sql_query(query, connection)
Successfully loaded the data
import pandas as pd
# Assuming df is your DataFrame
# Display the number of rows before outlier removal
print(f"Number of rows before outlier removal: {df.shape[0]}")
# Function to identify and replace outliers using IQR method
def handle_outliers_iqr(df):
df_cleaned = df.copy() # Create a copy of the DataFrame for modifications
outlier_info = {} # Dictionary to store information about outliers
for column in df_cleaned.select_dtypes(include=['float64', 'int64']).columns:
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df_cleaned[column].quantile(0.25)
Q3 = df_cleaned[column].quantile(0.75)
IQR = Q3 - Q1
# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify outliers
outliers = df_cleaned[(df_cleaned[column] < lower_bound) | (df_cleaned[column] > upper_bound)]
# Store the count of outliers
outlier_info[column] = outliers.shape[0]
# Replace outliers with median of the column
median_value = df_cleaned[column].median()
df_cleaned.loc[(df_cleaned[column] < lower_bound) | (df_cleaned[column] > upper_bound), column] = median_value
return df_cleaned, outlier_info
# Apply the function to handle outliers
df_cleaned, outlier_info = handle_outliers_iqr(df)
# Display the number of rows after outlier removal
print(f"Number of rows after outlier removal: {df_cleaned.shape[0]}")
# Display information about outliers
total_outliers_removed = sum(outlier_info.values())
print(f"\nTotal number of outliers removed: {total_outliers_removed}")
# Store the cleaned DataFrame in a new DataFrame
df_outliers_removed = df_cleaned.copy()
# Optionally, display the cleaned DataFrame's first few rows
print("\nFirst few rows of the DataFrame with outliers removed:")
print(df_outliers_removed.head())
Number of rows before outlier removal: 150001
Number of rows after outlier removal: 150001
Total number of outliers removed: 451082
First few rows of the DataFrame with outliers removed:
Bearer Id Start Start ms End End ms \
0 1.311448e+19 4/4/2019 12:01 770.0 4/25/2019 14:35 662.0
1 1.311448e+19 4/9/2019 13:04 235.0 4/25/2019 8:15 606.0
2 1.311448e+19 4/9/2019 17:42 1.0 4/25/2019 11:58 652.0
3 1.311448e+19 4/10/2019 0:31 486.0 4/25/2019 7:36 171.0
4 1.311448e+19 4/12/2019 20:10 565.0 4/25/2019 10:40 954.0
Dur. (ms) IMSI MSISDN/Number IMEI \
0 86399.0 2.082014e+14 3.366496e+10 3.552121e+13
1 86399.0 2.082019e+14 3.368185e+10 3.579401e+13
2 86399.0 2.082015e+14 3.366371e+10 3.528151e+13
3 86399.0 2.082014e+14 3.366371e+10 3.535661e+13
4 86399.0 2.082014e+14 3.369980e+10 3.540701e+13
Last Location Name ... Youtube DL (Bytes) Youtube UL (Bytes) \
0 9.16456699548519E+015 ... 15854611.0 2501332.0
1 L77566A ... 20247395.0 19111729.0
2 D42335A ... 19725661.0 14699576.0
3 T21824A ... 21388122.0 15146643.0
4 D88865A ... 15259380.0 18962873.0
Netflix DL (Bytes) Netflix UL (Bytes) Gaming DL (Bytes) \
0 8198936.0 9656251.0 278082303.0
1 18338413.0 17227132.0 608750074.0
2 17587794.0 6163408.0 229584621.0
3 13994646.0 1097942.0 799538153.0
4 17124581.0 415218.0 527707248.0
Gaming UL (Bytes) Other DL (Bytes) Other UL (Bytes) Total UL (Bytes) \
0 14344150.0 171744450.0 8814393.0 36749741.0
1 1170709.0 526904238.0 15055145.0 53800391.0
2 395630.0 410692588.0 4215763.0 27883638.0
3 10849722.0 749039933.0 12797283.0 43324218.0
4 3529801.0 550709500.0 13910322.0 38542814.0
Total DL (Bytes)
0 308879636.0
1 653384965.0
2 279807335.0
3 846028530.0
4 569138589.0
[5 rows x 55 columns]
2. Data cleaning¶
Looking for outliers and replacing them using advanced methods like imputation.
import pandas as pd
# Assuming df is your DataFrame
# Display the number of rows before outlier removal
print(f"Number of rows before outlier removal: {df.shape[0]}")
# Function to identify and replace outliers using IQR method
def handle_outliers_iqr(df):
df_cleaned = df.copy() # Create a copy of the DataFrame for modifications
outlier_info = {} # Dictionary to store information about outliers
for column in df_cleaned.select_dtypes(include=['float64', 'int64']).columns:
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df_cleaned[column].quantile(0.25)
Q3 = df_cleaned[column].quantile(0.75)
IQR = Q3 - Q1
# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify outliers
outliers = df_cleaned[(df_cleaned[column] < lower_bound) | (df_cleaned[column] > upper_bound)]
# Store the count of outliers
outlier_info[column] = outliers.shape[0]
# Replace outliers with median of the column
median_value = df_cleaned[column].median()
df_cleaned.loc[(df_cleaned[column] < lower_bound) | (df_cleaned[column] > upper_bound), column] = median_value
return df_cleaned, outlier_info
# Apply the function to handle outliers
df_cleaned, outlier_info = handle_outliers_iqr(df)
# Display the number of rows after outlier removal
print(f"Number of rows after outlier removal: {df_cleaned.shape[0]}")
# Display information about outliers
total_outliers_removed = sum(outlier_info.values())
print(f"\nTotal number of outliers removed: {total_outliers_removed}")
# Store the cleaned DataFrame in a new DataFrame
df_outliers_removed = df_cleaned.copy()
# Optionally, display the cleaned DataFrame's first few rows
print("\nFirst few rows of the DataFrame with outliers removed:")
print(df_outliers_removed.head())
Number of rows before outlier removal: 150001
Number of rows after outlier removal: 150001
Total number of outliers removed: 451082
First few rows of the DataFrame with outliers removed:
Bearer Id Start Start ms End End ms \
0 1.311448e+19 4/4/2019 12:01 770.0 4/25/2019 14:35 662.0
1 1.311448e+19 4/9/2019 13:04 235.0 4/25/2019 8:15 606.0
2 1.311448e+19 4/9/2019 17:42 1.0 4/25/2019 11:58 652.0
3 1.311448e+19 4/10/2019 0:31 486.0 4/25/2019 7:36 171.0
4 1.311448e+19 4/12/2019 20:10 565.0 4/25/2019 10:40 954.0
Dur. (ms) IMSI MSISDN/Number IMEI \
0 86399.0 2.082014e+14 3.366496e+10 3.552121e+13
1 86399.0 2.082019e+14 3.368185e+10 3.579401e+13
2 86399.0 2.082015e+14 3.366371e+10 3.528151e+13
3 86399.0 2.082014e+14 3.366371e+10 3.535661e+13
4 86399.0 2.082014e+14 3.369980e+10 3.540701e+13
Last Location Name ... Youtube DL (Bytes) Youtube UL (Bytes) \
0 9.16456699548519E+015 ... 15854611.0 2501332.0
1 L77566A ... 20247395.0 19111729.0
2 D42335A ... 19725661.0 14699576.0
3 T21824A ... 21388122.0 15146643.0
4 D88865A ... 15259380.0 18962873.0
Netflix DL (Bytes) Netflix UL (Bytes) Gaming DL (Bytes) \
0 8198936.0 9656251.0 278082303.0
1 18338413.0 17227132.0 608750074.0
2 17587794.0 6163408.0 229584621.0
3 13994646.0 1097942.0 799538153.0
4 17124581.0 415218.0 527707248.0
Gaming UL (Bytes) Other DL (Bytes) Other UL (Bytes) Total UL (Bytes) \
0 14344150.0 171744450.0 8814393.0 36749741.0
1 1170709.0 526904238.0 15055145.0 53800391.0
2 395630.0 410692588.0 4215763.0 27883638.0
3 10849722.0 749039933.0 12797283.0 43324218.0
4 3529801.0 550709500.0 13910322.0 38542814.0
Total DL (Bytes)
0 308879636.0
1 653384965.0
2 279807335.0
3 846028530.0
4 569138589.0
[5 rows x 55 columns]
Looking for missing values and replacing them using advanced statistics techniques like imputation.
import pandas as pd
# Function to handle missing values using advanced statistical methods
def handle_missing_values(df):
df_filled = df.copy() # Create a copy of the DataFrame for modifications
for column in df_filled.columns:
# If the column is numeric (float or int), replace missing values with the median
if df_filled[column].dtype in ['float64', 'int64']:
median_value = df_filled[column].median()
df_filled[column] = df_filled[column].fillna(median_value)
# If the column is categorical (object), replace missing values with the mode (most frequent value)
elif df_filled[column].dtype == 'object':
mode_value = df_filled[column].mode()[0]
df_filled[column] = df_filled[column].fillna(mode_value)
return df_filled
# Apply the function to handle missing values
df_filled = handle_missing_values(df_outliers_removed)
# Display information about missing values after handling
print("Missing values after imputation:")
print(df_filled.isnull().sum())
# Store the results in a new DataFrame
df_missing_values_handled = df_filled.copy()
# Optionally, display the first few rows of the new DataFrame
print("\nFirst few rows of the DataFrame with missing values handled:")
print(df_missing_values_handled.head())
Missing values after imputation:
Bearer Id 0
Start 0
Start ms 0
End 0
End ms 0
Dur. (ms) 0
IMSI 0
MSISDN/Number 0
IMEI 0
Last Location Name 0
Avg RTT DL (ms) 0
Avg RTT UL (ms) 0
Avg Bearer TP DL (kbps) 0
Avg Bearer TP UL (kbps) 0
TCP DL Retrans. Vol (Bytes) 0
TCP UL Retrans. Vol (Bytes) 0
DL TP < 50 Kbps (%) 0
50 Kbps < DL TP < 250 Kbps (%) 0
250 Kbps < DL TP < 1 Mbps (%) 0
DL TP > 1 Mbps (%) 0
UL TP < 10 Kbps (%) 0
10 Kbps < UL TP < 50 Kbps (%) 0
50 Kbps < UL TP < 300 Kbps (%) 0
UL TP > 300 Kbps (%) 0
HTTP DL (Bytes) 0
HTTP UL (Bytes) 0
Activity Duration DL (ms) 0
Activity Duration UL (ms) 0
Dur. (ms).1 0
Handset Manufacturer 0
Handset Type 0
Nb of sec with 125000B < Vol DL 0
Nb of sec with 1250B < Vol UL < 6250B 0
Nb of sec with 31250B < Vol DL < 125000B 0
Nb of sec with 37500B < Vol UL 0
Nb of sec with 6250B < Vol DL < 31250B 0
Nb of sec with 6250B < Vol UL < 37500B 0
Nb of sec with Vol DL < 6250B 0
Nb of sec with Vol UL < 1250B 0
Social Media DL (Bytes) 0
Social Media UL (Bytes) 0
Google DL (Bytes) 0
Google UL (Bytes) 0
Email DL (Bytes) 0
Email UL (Bytes) 0
Youtube DL (Bytes) 0
Youtube UL (Bytes) 0
Netflix DL (Bytes) 0
Netflix UL (Bytes) 0
Gaming DL (Bytes) 0
Gaming UL (Bytes) 0
Other DL (Bytes) 0
Other UL (Bytes) 0
Total UL (Bytes) 0
Total DL (Bytes) 0
dtype: int64
First few rows of the DataFrame with missing values handled:
Bearer Id Start Start ms End End ms \
0 1.311448e+19 4/4/2019 12:01 770.0 4/25/2019 14:35 662.0
1 1.311448e+19 4/9/2019 13:04 235.0 4/25/2019 8:15 606.0
2 1.311448e+19 4/9/2019 17:42 1.0 4/25/2019 11:58 652.0
3 1.311448e+19 4/10/2019 0:31 486.0 4/25/2019 7:36 171.0
4 1.311448e+19 4/12/2019 20:10 565.0 4/25/2019 10:40 954.0
Dur. (ms) IMSI MSISDN/Number IMEI \
0 86399.0 2.082014e+14 3.366496e+10 3.552121e+13
1 86399.0 2.082019e+14 3.368185e+10 3.579401e+13
2 86399.0 2.082015e+14 3.366371e+10 3.528151e+13
3 86399.0 2.082014e+14 3.366371e+10 3.535661e+13
4 86399.0 2.082014e+14 3.369980e+10 3.540701e+13
Last Location Name ... Youtube DL (Bytes) Youtube UL (Bytes) \
0 9.16456699548519E+015 ... 15854611.0 2501332.0
1 L77566A ... 20247395.0 19111729.0
2 D42335A ... 19725661.0 14699576.0
3 T21824A ... 21388122.0 15146643.0
4 D88865A ... 15259380.0 18962873.0
Netflix DL (Bytes) Netflix UL (Bytes) Gaming DL (Bytes) \
0 8198936.0 9656251.0 278082303.0
1 18338413.0 17227132.0 608750074.0
2 17587794.0 6163408.0 229584621.0
3 13994646.0 1097942.0 799538153.0
4 17124581.0 415218.0 527707248.0
Gaming UL (Bytes) Other DL (Bytes) Other UL (Bytes) Total UL (Bytes) \
0 14344150.0 171744450.0 8814393.0 36749741.0
1 1170709.0 526904238.0 15055145.0 53800391.0
2 395630.0 410692588.0 4215763.0 27883638.0
3 10849722.0 749039933.0 12797283.0 43324218.0
4 3529801.0 550709500.0 13910322.0 38542814.0
Total DL (Bytes)
0 308879636.0
1 653384965.0
2 279807335.0
3 846028530.0
4 569138589.0
[5 rows x 55 columns]
Lastly insuring that there are no duplicate values and conversion into the correct data types.
import pandas as pd
# Ensure consistency in the dataset
def ensure_consistency(df):
# Remove duplicate entries
df_cleaned = df.drop_duplicates().reset_index(drop=True)
# Convert data types to appropriate types
for column in df_cleaned.columns:
# If the column contains numeric data but is stored as an object, convert it to numeric
if df_cleaned[column].dtype == 'object':
try:
df_cleaned[column] = pd.to_numeric(df_cleaned[column])
except ValueError:
# If conversion fails, it's likely a categorical column, so leave it as is
pass
# Convert datetime-like strings to actual datetime objects with a specific format if possible
if df_cleaned[column].dtype == 'object':
sample_value = df_cleaned[column].dropna().iloc[0] # Take a sample value from the column
try:
# Check if the sample value looks like a date and if so, convert the entire column
if isinstance(pd.to_datetime(sample_value, format='%Y-%m-%d', errors='raise'), pd.Timestamp):
df_cleaned[column] = pd.to_datetime(df_cleaned[column], format='%Y-%m-%d', errors='coerce')
elif isinstance(pd.to_datetime(sample_value, format='%d/%m/%Y', errors='raise'), pd.Timestamp):
df_cleaned[column] = pd.to_datetime(df_cleaned[column], format='%d/%m/%Y', errors='coerce')
# Add other date formats as needed
except (ValueError, TypeError):
# If conversion fails, leave the column as is
pass
return df_cleaned
# Apply the function to the df_missing_values_handled DataFrame
df_cleaned = ensure_consistency(df_missing_values_handled)
# Optionally, display information about the cleaned DataFrame
print("DataFrame info after ensuring consistency:")
print(df_cleaned.info())
# Optionally, display the first few rows of the cleaned DataFrame
print("\nFirst few rows of the DataFrame after ensuring consistency:")
print(df_cleaned.head())
DataFrame info after ensuring consistency:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150001 entries, 0 to 150000
Data columns (total 55 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Bearer Id 150001 non-null float64
1 Start 150001 non-null object
2 Start ms 150001 non-null float64
3 End 150001 non-null object
4 End ms 150001 non-null float64
5 Dur. (ms) 150001 non-null float64
6 IMSI 150001 non-null float64
7 MSISDN/Number 150001 non-null float64
8 IMEI 150001 non-null float64
9 Last Location Name 150001 non-null object
10 Avg RTT DL (ms) 150001 non-null float64
11 Avg RTT UL (ms) 150001 non-null float64
12 Avg Bearer TP DL (kbps) 150001 non-null float64
13 Avg Bearer TP UL (kbps) 150001 non-null float64
14 TCP DL Retrans. Vol (Bytes) 150001 non-null float64
15 TCP UL Retrans. Vol (Bytes) 150001 non-null float64
16 DL TP < 50 Kbps (%) 150001 non-null float64
17 50 Kbps < DL TP < 250 Kbps (%) 150001 non-null float64
18 250 Kbps < DL TP < 1 Mbps (%) 150001 non-null float64
19 DL TP > 1 Mbps (%) 150001 non-null float64
20 UL TP < 10 Kbps (%) 150001 non-null float64
21 10 Kbps < UL TP < 50 Kbps (%) 150001 non-null float64
22 50 Kbps < UL TP < 300 Kbps (%) 150001 non-null float64
23 UL TP > 300 Kbps (%) 150001 non-null float64
24 HTTP DL (Bytes) 150001 non-null float64
25 HTTP UL (Bytes) 150001 non-null float64
26 Activity Duration DL (ms) 150001 non-null float64
27 Activity Duration UL (ms) 150001 non-null float64
28 Dur. (ms).1 150001 non-null float64
29 Handset Manufacturer 150001 non-null object
30 Handset Type 150001 non-null object
31 Nb of sec with 125000B < Vol DL 150001 non-null float64
32 Nb of sec with 1250B < Vol UL < 6250B 150001 non-null float64
33 Nb of sec with 31250B < Vol DL < 125000B 150001 non-null float64
34 Nb of sec with 37500B < Vol UL 150001 non-null float64
35 Nb of sec with 6250B < Vol DL < 31250B 150001 non-null float64
36 Nb of sec with 6250B < Vol UL < 37500B 150001 non-null float64
37 Nb of sec with Vol DL < 6250B 150001 non-null float64
38 Nb of sec with Vol UL < 1250B 150001 non-null float64
39 Social Media DL (Bytes) 150001 non-null float64
40 Social Media UL (Bytes) 150001 non-null float64
41 Google DL (Bytes) 150001 non-null float64
42 Google UL (Bytes) 150001 non-null float64
43 Email DL (Bytes) 150001 non-null float64
44 Email UL (Bytes) 150001 non-null float64
45 Youtube DL (Bytes) 150001 non-null float64
46 Youtube UL (Bytes) 150001 non-null float64
47 Netflix DL (Bytes) 150001 non-null float64
48 Netflix UL (Bytes) 150001 non-null float64
49 Gaming DL (Bytes) 150001 non-null float64
50 Gaming UL (Bytes) 150001 non-null float64
51 Other DL (Bytes) 150001 non-null float64
52 Other UL (Bytes) 150001 non-null float64
53 Total UL (Bytes) 150001 non-null float64
54 Total DL (Bytes) 150001 non-null float64
dtypes: float64(50), object(5)
memory usage: 62.9+ MB
None
First few rows of the DataFrame after ensuring consistency:
Bearer Id Start Start ms End End ms \
0 1.311448e+19 4/4/2019 12:01 770.0 4/25/2019 14:35 662.0
1 1.311448e+19 4/9/2019 13:04 235.0 4/25/2019 8:15 606.0
2 1.311448e+19 4/9/2019 17:42 1.0 4/25/2019 11:58 652.0
3 1.311448e+19 4/10/2019 0:31 486.0 4/25/2019 7:36 171.0
4 1.311448e+19 4/12/2019 20:10 565.0 4/25/2019 10:40 954.0
Dur. (ms) IMSI MSISDN/Number IMEI \
0 86399.0 2.082014e+14 3.366496e+10 3.552121e+13
1 86399.0 2.082019e+14 3.368185e+10 3.579401e+13
2 86399.0 2.082015e+14 3.366371e+10 3.528151e+13
3 86399.0 2.082014e+14 3.366371e+10 3.535661e+13
4 86399.0 2.082014e+14 3.369980e+10 3.540701e+13
Last Location Name ... Youtube DL (Bytes) Youtube UL (Bytes) \
0 9.16456699548519E+015 ... 15854611.0 2501332.0
1 L77566A ... 20247395.0 19111729.0
2 D42335A ... 19725661.0 14699576.0
3 T21824A ... 21388122.0 15146643.0
4 D88865A ... 15259380.0 18962873.0
Netflix DL (Bytes) Netflix UL (Bytes) Gaming DL (Bytes) \
0 8198936.0 9656251.0 278082303.0
1 18338413.0 17227132.0 608750074.0
2 17587794.0 6163408.0 229584621.0
3 13994646.0 1097942.0 799538153.0
4 17124581.0 415218.0 527707248.0
Gaming UL (Bytes) Other DL (Bytes) Other UL (Bytes) Total UL (Bytes) \
0 14344150.0 171744450.0 8814393.0 36749741.0
1 1170709.0 526904238.0 15055145.0 53800391.0
2 395630.0 410692588.0 4215763.0 27883638.0
3 10849722.0 749039933.0 12797283.0 43324218.0
4 3529801.0 550709500.0 13910322.0 38542814.0
Total DL (Bytes)
0 308879636.0
1 653384965.0
2 279807335.0
3 846028530.0
4 569138589.0
[5 rows x 55 columns]
Final overview after the data has been cleaned.
import pandas as pd
def display_dataset_characteristics(df):
print("### Dataset Characteristics ###\n")
# Display the shape of the dataset
print(f"Shape of the DataFrame: {df.shape}")
# Display the column data types
print("\nData Types of Each Column:")
print(df.dtypes)
# Display summary statistics for numeric columns
print("\nSummary Statistics for Numeric Columns:")
print(df.describe())
# Display the count of missing values per column
print("\nCount of Missing Values per Column:")
print(df.isnull().sum())
# Display the count of unique values per column
print("\nCount of Unique Values per Column:")
print(df.nunique())
# Display the first few rows of the cleaned dataset
print("\nFirst Few Rows of the DataFrame:")
print(df.head())
# Apply the function to the df_cleaned DataFrame
display_dataset_characteristics(df_cleaned)
### Dataset Characteristics ###
Shape of the DataFrame: (150001, 55)
Data Types of Each Column:
Bearer Id float64
Start object
Start ms float64
End object
End ms float64
Dur. (ms) float64
IMSI float64
MSISDN/Number float64
IMEI float64
Last Location Name object
Avg RTT DL (ms) float64
Avg RTT UL (ms) float64
Avg Bearer TP DL (kbps) float64
Avg Bearer TP UL (kbps) float64
TCP DL Retrans. Vol (Bytes) float64
TCP UL Retrans. Vol (Bytes) float64
DL TP < 50 Kbps (%) float64
50 Kbps < DL TP < 250 Kbps (%) float64
250 Kbps < DL TP < 1 Mbps (%) float64
DL TP > 1 Mbps (%) float64
UL TP < 10 Kbps (%) float64
10 Kbps < UL TP < 50 Kbps (%) float64
50 Kbps < UL TP < 300 Kbps (%) float64
UL TP > 300 Kbps (%) float64
HTTP DL (Bytes) float64
HTTP UL (Bytes) float64
Activity Duration DL (ms) float64
Activity Duration UL (ms) float64
Dur. (ms).1 float64
Handset Manufacturer object
Handset Type object
Nb of sec with 125000B < Vol DL float64
Nb of sec with 1250B < Vol UL < 6250B float64
Nb of sec with 31250B < Vol DL < 125000B float64
Nb of sec with 37500B < Vol UL float64
Nb of sec with 6250B < Vol DL < 31250B float64
Nb of sec with 6250B < Vol UL < 37500B float64
Nb of sec with Vol DL < 6250B float64
Nb of sec with Vol UL < 1250B float64
Social Media DL (Bytes) float64
Social Media UL (Bytes) float64
Google DL (Bytes) float64
Google UL (Bytes) float64
Email DL (Bytes) float64
Email UL (Bytes) float64
Youtube DL (Bytes) float64
Youtube UL (Bytes) float64
Netflix DL (Bytes) float64
Netflix UL (Bytes) float64
Gaming DL (Bytes) float64
Gaming UL (Bytes) float64
Other DL (Bytes) float64
Other UL (Bytes) float64
Total UL (Bytes) float64
Total DL (Bytes) float64
dtype: object
Summary Statistics for Numeric Columns:
Bearer Id Start ms End ms Dur. (ms) \
count 1.500010e+05 150001.000000 150001.000000 150001.000000
mean 1.012045e+19 499.188199 498.800888 91980.342251
std 2.892438e+18 288.610872 288.096693 49294.358994
min 6.917538e+18 0.000000 0.000000 7142.000000
25% 7.349883e+18 250.000000 251.000000 57442.000000
50% 7.349883e+18 499.000000 500.000000 86399.000000
75% 1.304243e+19 749.000000 750.000000 116737.000000
max 1.318654e+19 999.000000 999.000000 244900.000000
IMSI MSISDN/Number IMEI Avg RTT DL (ms) \
count 1.500010e+05 1.500010e+05 1.500010e+05 150001.000000
mean 2.082015e+14 3.365881e+10 4.842592e+13 46.622169
std 3.299100e+08 2.065503e+07 2.238739e+13 19.304219
min 2.082009e+14 3.360301e+10 4.400152e+11 0.000000
25% 2.082014e+14 3.365222e+10 3.546261e+13 35.000000
50% 2.082015e+14 3.366371e+10 3.572201e+13 45.000000
75% 2.082018e+14 3.366649e+10 8.611970e+13 51.000000
max 2.082022e+14 3.370000e+10 9.900120e+13 127.000000
Avg RTT UL (ms) Avg Bearer TP DL (kbps) ... Youtube DL (Bytes) \
count 150001.000000 150001.000000 ... 1.500010e+05
mean 7.135372 6776.175892 ... 1.163407e+07
std 7.067340 12624.783394 ... 6.710569e+06
min 0.000000 0.000000 ... 5.300000e+01
25% 3.000000 43.000000 ... 5.833501e+06
50% 5.000000 63.000000 ... 1.161602e+07
75% 7.000000 6901.000000 ... 1.744852e+07
max 34.000000 49211.000000 ... 2.325910e+07
Youtube UL (Bytes) Netflix DL (Bytes) Netflix UL (Bytes) \
count 1.500010e+05 1.500010e+05 1.500010e+05
mean 1.100941e+07 1.162685e+07 1.100175e+07
std 6.345423e+06 6.725218e+06 6.359490e+06
min 1.050000e+02 4.200000e+01 3.500000e+01
25% 5.517965e+06 5.777156e+06 5.475981e+06
50% 1.101345e+07 1.164222e+07 1.099638e+07
75% 1.651556e+07 1.747048e+07 1.650727e+07
max 2.201196e+07 2.325919e+07 2.201196e+07
Gaming DL (Bytes) Gaming UL (Bytes) Other DL (Bytes) \
count 1.500010e+05 1.500010e+05 1.500010e+05
mean 4.220447e+08 8.288398e+06 4.211005e+08
std 2.439675e+08 4.782700e+06 2.432050e+08
min 2.516000e+03 5.900000e+01 3.290000e+03
25% 2.104733e+08 4.128476e+06 2.101869e+08
50% 4.234081e+08 8.291208e+06 4.218030e+08
75% 6.331742e+08 1.243162e+07 6.316918e+08
max 8.434419e+08 1.655879e+07 8.434425e+08
Other UL (Bytes) Total UL (Bytes) Total DL (Bytes)
count 1.500010e+05 1.500010e+05 1.500010e+05
mean 8.264799e+06 4.111975e+07 4.546434e+08
std 4.769004e+06 1.119712e+07 2.441421e+08
min 1.480000e+02 9.512288e+06 7.114041e+06
25% 4.145943e+06 3.324942e+07 2.431072e+08
50% 8.267071e+06 4.114331e+07 4.558411e+08
75% 1.238415e+07 4.900537e+07 6.657051e+08
max 1.655882e+07 7.271475e+07 9.029696e+08
[8 rows x 50 columns]
Count of Missing Values per Column:
Bearer Id 0
Start 0
Start ms 0
End 0
End ms 0
Dur. (ms) 0
IMSI 0
MSISDN/Number 0
IMEI 0
Last Location Name 0
Avg RTT DL (ms) 0
Avg RTT UL (ms) 0
Avg Bearer TP DL (kbps) 0
Avg Bearer TP UL (kbps) 0
TCP DL Retrans. Vol (Bytes) 0
TCP UL Retrans. Vol (Bytes) 0
DL TP < 50 Kbps (%) 0
50 Kbps < DL TP < 250 Kbps (%) 0
250 Kbps < DL TP < 1 Mbps (%) 0
DL TP > 1 Mbps (%) 0
UL TP < 10 Kbps (%) 0
10 Kbps < UL TP < 50 Kbps (%) 0
50 Kbps < UL TP < 300 Kbps (%) 0
UL TP > 300 Kbps (%) 0
HTTP DL (Bytes) 0
HTTP UL (Bytes) 0
Activity Duration DL (ms) 0
Activity Duration UL (ms) 0
Dur. (ms).1 0
Handset Manufacturer 0
Handset Type 0
Nb of sec with 125000B < Vol DL 0
Nb of sec with 1250B < Vol UL < 6250B 0
Nb of sec with 31250B < Vol DL < 125000B 0
Nb of sec with 37500B < Vol UL 0
Nb of sec with 6250B < Vol DL < 31250B 0
Nb of sec with 6250B < Vol UL < 37500B 0
Nb of sec with Vol DL < 6250B 0
Nb of sec with Vol UL < 1250B 0
Social Media DL (Bytes) 0
Social Media UL (Bytes) 0
Google DL (Bytes) 0
Google UL (Bytes) 0
Email DL (Bytes) 0
Email UL (Bytes) 0
Youtube DL (Bytes) 0
Youtube UL (Bytes) 0
Netflix DL (Bytes) 0
Netflix UL (Bytes) 0
Gaming DL (Bytes) 0
Gaming UL (Bytes) 0
Other DL (Bytes) 0
Other UL (Bytes) 0
Total UL (Bytes) 0
Total DL (Bytes) 0
dtype: int64
Count of Unique Values per Column:
Bearer Id 134709
Start 9997
Start ms 1000
End 6403
End ms 1000
Dur. (ms) 82601
IMSI 97425
MSISDN/Number 89699
IMEI 107270
Last Location Name 45547
Avg RTT DL (ms) 126
Avg RTT UL (ms) 35
Avg Bearer TP DL (kbps) 30140
Avg Bearer TP UL (kbps) 2729
TCP DL Retrans. Vol (Bytes) 45448
TCP UL Retrans. Vol (Bytes) 27763
DL TP < 50 Kbps (%) 23
50 Kbps < DL TP < 250 Kbps (%) 11
250 Kbps < DL TP < 1 Mbps (%) 3
DL TP > 1 Mbps (%) 1
UL TP < 10 Kbps (%) 3
10 Kbps < UL TP < 50 Kbps (%) 1
50 Kbps < UL TP < 300 Kbps (%) 1
UL TP > 300 Kbps (%) 1
HTTP DL (Bytes) 51155
HTTP UL (Bytes) 49530
Activity Duration DL (ms) 75936
Activity Duration UL (ms) 80333
Dur. (ms).1 115691
Handset Manufacturer 170
Handset Type 1396
Nb of sec with 125000B < Vol DL 1693
Nb of sec with 1250B < Vol UL < 6250B 492
Nb of sec with 31250B < Vol DL < 125000B 1848
Nb of sec with 37500B < Vol UL 84
Nb of sec with 6250B < Vol DL < 31250B 2653
Nb of sec with 6250B < Vol UL < 37500B 73
Nb of sec with Vol DL < 6250B 6324
Nb of sec with Vol UL < 1250B 5814
Social Media DL (Bytes) 146856
Social Media UL (Bytes) 59078
Google DL (Bytes) 149024
Google UL (Bytes) 147267
Email DL (Bytes) 146916
Email UL (Bytes) 138699
Youtube DL (Bytes) 149547
Youtube UL (Bytes) 149477
Netflix DL (Bytes) 149518
Netflix UL (Bytes) 149490
Gaming DL (Bytes) 149983
Gaming UL (Bytes) 149316
Other DL (Bytes) 149986
Other UL (Bytes) 149284
Total UL (Bytes) 149486
Total DL (Bytes) 149988
dtype: int64
First Few Rows of the DataFrame:
Bearer Id Start Start ms End End ms \
0 1.311448e+19 4/4/2019 12:01 770.0 4/25/2019 14:35 662.0
1 1.311448e+19 4/9/2019 13:04 235.0 4/25/2019 8:15 606.0
2 1.311448e+19 4/9/2019 17:42 1.0 4/25/2019 11:58 652.0
3 1.311448e+19 4/10/2019 0:31 486.0 4/25/2019 7:36 171.0
4 1.311448e+19 4/12/2019 20:10 565.0 4/25/2019 10:40 954.0
Dur. (ms) IMSI MSISDN/Number IMEI \
0 86399.0 2.082014e+14 3.366496e+10 3.552121e+13
1 86399.0 2.082019e+14 3.368185e+10 3.579401e+13
2 86399.0 2.082015e+14 3.366371e+10 3.528151e+13
3 86399.0 2.082014e+14 3.366371e+10 3.535661e+13
4 86399.0 2.082014e+14 3.369980e+10 3.540701e+13
Last Location Name ... Youtube DL (Bytes) Youtube UL (Bytes) \
0 9.16456699548519E+015 ... 15854611.0 2501332.0
1 L77566A ... 20247395.0 19111729.0
2 D42335A ... 19725661.0 14699576.0
3 T21824A ... 21388122.0 15146643.0
4 D88865A ... 15259380.0 18962873.0
Netflix DL (Bytes) Netflix UL (Bytes) Gaming DL (Bytes) \
0 8198936.0 9656251.0 278082303.0
1 18338413.0 17227132.0 608750074.0
2 17587794.0 6163408.0 229584621.0
3 13994646.0 1097942.0 799538153.0
4 17124581.0 415218.0 527707248.0
Gaming UL (Bytes) Other DL (Bytes) Other UL (Bytes) Total UL (Bytes) \
0 14344150.0 171744450.0 8814393.0 36749741.0
1 1170709.0 526904238.0 15055145.0 53800391.0
2 395630.0 410692588.0 4215763.0 27883638.0
3 10849722.0 749039933.0 12797283.0 43324218.0
4 3529801.0 550709500.0 13910322.0 38542814.0
Total DL (Bytes)
0 308879636.0
1 653384965.0
2 279807335.0
3 846028530.0
4 569138589.0
[5 rows x 55 columns]
Phase 2 User analysis¶
In this phase we are goijng to identify the top 10 handsets, Identify top 3 hanset manufacturers, identify the top 5 handsets per top 3 manufacturers and a short interpretation.
import pandas as pd
# Phase 2: User Analysis
# 1. Identify Top 10 Handsets
def top_10_handsets(df):
top_handsets = df['Handset Type'].value_counts().head(10)
return top_handsets
# 2. Identify Top 3 Handset Manufacturers
def top_3_manufacturers(df):
top_manufacturers = df['Handset Manufacturer'].value_counts().head(3)
return top_manufacturers
# 3. Identify Top 5 Handsets per Top 3 Manufacturers
def top_5_handsets_per_manufacturer(df, top_manufacturers):
top_5_handsets_per_manufacturer = {}
for manufacturer in top_manufacturers.index:
handsets = df[df['Handset Manufacturer'] == manufacturer]['Handset Type']
top_handsets = handsets.value_counts().head(5)
top_5_handsets_per_manufacturer[manufacturer] = top_handsets
return top_5_handsets_per_manufacturer
# Execute analysis
top_10_handsets_df = top_10_handsets(df_cleaned)
top_3_manufacturers_df = top_3_manufacturers(df_cleaned)
top_5_handsets_per_manufacturer_df = top_5_handsets_per_manufacturer(df_cleaned, top_3_manufacturers_df)
# Display results
print("### Top 10 Handsets ###")
print(top_10_handsets_df)
print("\n### Top 3 Handset Manufacturers ###")
print(top_3_manufacturers_df)
print("\n### Top 5 Handsets per Top 3 Manufacturers ###")
for manufacturer, handsets in top_5_handsets_per_manufacturer_df.items():
print(f"\n{manufacturer}:")
print(handsets)
### Top 10 Handsets ### Handset Type Huawei B528S-23A 20324 Apple iPhone 6S (A1688) 9419 Apple iPhone 6 (A1586) 9023 undefined 8987 Apple iPhone 7 (A1778) 6326 Apple iPhone Se (A1723) 5187 Apple iPhone 8 (A1905) 4993 Apple iPhone Xr (A2105) 4568 Samsung Galaxy S8 (Sm-G950F) 4520 Apple iPhone X (A1901) 3813 Name: count, dtype: int64 ### Top 3 Handset Manufacturers ### Handset Manufacturer Apple 60137 Samsung 40839 Huawei 34423 Name: count, dtype: int64 ### Top 5 Handsets per Top 3 Manufacturers ### Apple: Handset Type Apple iPhone 6S (A1688) 9419 Apple iPhone 6 (A1586) 9023 Apple iPhone 7 (A1778) 6326 Apple iPhone Se (A1723) 5187 Apple iPhone 8 (A1905) 4993 Name: count, dtype: int64 Samsung: Handset Type Samsung Galaxy S8 (Sm-G950F) 4520 Samsung Galaxy A5 Sm-A520F 3724 Samsung Galaxy J5 (Sm-J530) 3696 Samsung Galaxy J3 (Sm-J330) 3484 Samsung Galaxy S7 (Sm-G930X) 3199 Name: count, dtype: int64 Huawei: Handset Type Huawei B528S-23A 19752 Huawei E5180 2079 Huawei P20 Lite Huawei Nova 3E 2021 Huawei P20 1480 Huawei Y6 2018 997 Name: count, dtype: int64
Interpretation of the above analysis about handset type and handsetmanufacturers¶
The analysis of the dataset reveals that Apple, Samsung, and Huawei are the top three handset manufacturers, with Apple leading by a significant margin in terms of handset usage. Apple devices dominate the list of top 10 handsets, with models like the Apple iPhone 6S and iPhone 6 being the most frequently used. This indicates strong brand loyalty and high usage rates for Apple's product line.
Samsung follows as the second most popular manufacturer, with the Samsung Galaxy S8 being the most used handset among Samsung models. Samsung's top 5 handsets include various Galaxy models, highlighting the popularity of their Galaxy series among users.
Huawei ranks third, with the Huawei B528S-23A being the most prevalent handset in their lineup. While Huawei's top 5 handsets show a diversity of models, their usage frequency is lower compared to Apple and Samsung. This suggests that while Huawei is a significant player in the market, it does not match the dominance of Apple and Samsung in terms of handset popularity.
Phase 3: Application Beahvior Analsyis¶
In this stage we are going to aggregate user behavior data and visualize user behavior.
Aggregation of User behavior data:¶
import pandas as pd
# Sample DataFrame creation for demonstration purposes
# df_cleaned = pd.read_csv('your_data_file.csv') # Replace with your actual data loading code
# Phase 3: Application Behavior Analysis
# Define the columns for aggregation
agg_columns = {
'Bearer Id': 'Number_of_sessions',
'Dur. (ms)': 'Total_duration',
'total_download_data': 'Total_download_data',
'total_upload_data': 'Total_upload_data',
'social_media_dl': 'Social_media_dl',
'social_media_ul': 'Social_media_ul',
'Youtube DL (Bytes)': 'Youtube_dl',
'Youtube UL (Bytes)': 'Youtube_ul',
'netflix_dl': 'Netflix_dl',
'Netflix_ul': 'Netflix_ul',
'google_dl': 'Google_dl',
'google_ul': 'Google_ul',
'email_dl': 'Email_dl',
'email_ul': 'Email_ul',
'gaming_dl': 'Gaming_dl',
'gaming_ul': 'Gaming_ul'
}
# Aggregate User Behavior Data
def aggregate_user_behavior(df):
try:
# Aggregate data by user
user_behavior = df.groupby('MSISDN/Number').agg(
number_of_sessions=('Bearer Id', 'count'),
total_duration=('Dur. (ms)', 'sum'),
total_download_data=('total_download_data', 'sum'),
total_upload_data=('total_upload_data', 'sum'),
social_media_dl=('social_media_dl', 'sum'),
social_media_ul=('social_media_ul', 'sum'),
youtube_dl=('Youtube DL (Bytes)', 'sum'),
youtube_ul=('Youtube UL (Bytes)', 'sum'),
netflix_dl=('netflix_dl', 'sum'),
netflix_ul=('netflix_ul', 'sum'),
google_dl=('google_dl', 'sum'),
google_ul=('google_ul', 'sum'),
email_dl=('email_dl', 'sum'),
email_ul=('email_ul', 'sum'),
gaming_dl=('gaming_dl', 'sum'),
gaming_ul=('gaming_ul', 'sum')
).reset_index()
return user_behavior
except KeyError as e:
print(f"KeyError: {e}")
print("Check if all column names used in aggregation exist in the DataFrame.")
return None
# Aggregate the data
user_behavior = aggregate_user_behavior(df_cleaned)
# Display the result
if user_behavior is not None:
print(user_behavior.head()) # Display the first few rows of the aggregated data
MSISDN/Number number_of_sessions total_duration total_download_data \ 0 3.360301e+10 2 120325.0 559774090.0 1 3.360301e+10 1 135718.0 809319344.0 2 3.360301e+10 1 109906.0 73126355.0 3 3.360301e+10 1 135968.0 404238587.0 4 3.360301e+10 2 317851.0 663280757.0 total_upload_data social_media_dl social_media_ul youtube_dl \ 0 82689008.0 3645050.0 34363.0 19763706.0 1 54693881.0 3129315.0 26481.0 12069683.0 2 49178027.0 2889635.0 31756.0 21993049.0 3 42532640.0 1937284.0 56480.0 7340018.0 4 121212483.0 5673144.0 9576.0 23605808.0 youtube_ul netflix_dl netflix_ul google_dl google_ul email_dl \ 0 23968713.0 21531149.0 28849501.0 13826874.0 3989839.0 4911001.0 1 21791120.0 5368625.0 9743429.0 6224071.0 1578117.0 1044945.0 2 12521844.0 500502.0 8105653.0 9186718.0 155369.0 389390.0 3 662245.0 21836623.0 7774556.0 1844790.0 3691346.0 1665258.0 4 30701746.0 36155210.0 35282326.0 8146098.0 7023976.0 4594679.0 email_ul gaming_dl gaming_ul 0 813612.0 496096310.0 15801784.0 1 853721.0 781482705.0 6630497.0 2 97640.0 38167061.0 11719976.0 3 373764.0 369614614.0 13798986.0 4 372370.0 585105818.0 28752436.0
Visualizations for User behavior:¶
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Sample DataFrame creation for demonstration purposes
# df_cleaned = pd.read_csv('your_data_file.csv') # Replace with your actual data loading code
# Phase 3: Application Behavior Analysis
# Define the columns for aggregation
agg_columns = {
'Bearer Id': 'Number_of_sessions',
'Dur. (ms)': 'Total_duration',
'total_download_data': 'Total_download_data',
'total_upload_data': 'Total_upload_data',
'social_media_dl': 'Social_media_dl',
'social_media_ul': 'Social_media_ul',
'Youtube DL (Bytes)': 'Youtube_dl',
'Youtube UL (Bytes)': 'Youtube_ul',
'netflix_dl': 'Netflix_dl',
'Netflix_ul': 'Netflix_ul',
'google_dl': 'Google_dl',
'google_ul': 'Google_ul',
'email_dl': 'Email_dl',
'email_ul': 'Email_ul',
'gaming_dl': 'Gaming_dl',
'gaming_ul': 'Gaming_ul'
}
# Aggregate User Behavior Data
def aggregate_user_behavior(df):
try:
# Aggregate data by user
user_behavior = df.groupby('MSISDN/Number').agg(
number_of_sessions=('Bearer Id', 'count'),
total_duration=('Dur. (ms)', 'sum'),
total_download_data=('total_download_data', 'sum'),
total_upload_data=('total_upload_data', 'sum'),
social_media_dl=('social_media_dl', 'sum'),
social_media_ul=('social_media_ul', 'sum'),
youtube_dl=('Youtube DL (Bytes)', 'sum'),
youtube_ul=('Youtube UL (Bytes)', 'sum'),
netflix_dl=('netflix_dl', 'sum'),
netflix_ul=('netflix_ul', 'sum'),
google_dl=('google_dl', 'sum'),
google_ul=('google_ul', 'sum'),
email_dl=('email_dl', 'sum'),
email_ul=('email_ul', 'sum'),
gaming_dl=('gaming_dl', 'sum'),
gaming_ul=('gaming_ul', 'sum')
).reset_index()
return user_behavior
except KeyError as e:
print(f"KeyError: {e}")
print("Check if all column names used in aggregation exist in the DataFrame.")
return None
# Aggregate the data
user_behavior = aggregate_user_behavior(df_cleaned)
# Visualize User Behavior
if user_behavior is not None:
# 1. Bar Chart of Total Download and Upload Data by Application with Logarithmic Scale
applications = [
'social_media_dl', 'social_media_ul',
'youtube_dl', 'youtube_ul',
'netflix_dl', 'netflix_ul',
'google_dl', 'google_ul',
'email_dl', 'email_ul',
'gaming_dl', 'gaming_ul'
]
# Create a new DataFrame for plotting
behavior_data = user_behavior[applications].sum().reset_index()
behavior_data.columns = ['Application', 'Total Data (Bytes)']
plt.figure(figsize=(12, 8))
sns.barplot(x='Total Data (Bytes)', y='Application', data=behavior_data, palette='viridis')
plt.xscale('log') # Apply logarithmic scale to the x-axis
plt.title('Total Data Usage by Application (Logarithmic Scale)')
plt.xlabel('Total Data (Bytes)')
plt.ylabel('Application')
plt.show()
# 2. Stacked Bar Chart of Total Download and Upload Data by Application
# Create DataFrame for stacked bar chart
download_data = user_behavior[['social_media_dl', 'youtube_dl', 'netflix_dl', 'google_dl', 'email_dl', 'gaming_dl']].sum()
upload_data = user_behavior[['social_media_ul', 'youtube_ul', 'netflix_ul', 'google_ul', 'email_ul', 'gaming_ul']].sum()
stacked_data = pd.DataFrame({
'Application': download_data.index,
'Download': download_data.values,
'Upload': upload_data.values
})
# Plot stacked bar chart
stacked_data.set_index('Application').plot(kind='bar', stacked=True, figsize=(12, 8), color=['#1f77b4', '#ff7f0e'])
plt.title('Total Download and Upload Data by Application')
plt.xlabel('Application')
plt.ylabel('Total Data (Bytes)')
plt.yscale('log') # Apply logarithmic scale to the y-axis for better visualization
plt.show()
# 3. Pie Chart of Total Download Data by Application with Logarithmic Scale
download_data = user_behavior[['social_media_dl', 'youtube_dl', 'netflix_dl', 'google_dl', 'email_dl', 'gaming_dl']].sum()
plt.figure(figsize=(10, 8))
plt.pie(download_data, labels=download_data.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette('viridis', len(download_data)))
plt.title('Total Download Data Distribution by Application')
plt.show()
# 4. Pie Chart of Total Upload Data by Application with Logarithmic Scale
upload_data = user_behavior[['social_media_ul', 'youtube_ul', 'netflix_ul', 'google_ul', 'email_ul', 'gaming_ul']].sum()
plt.figure(figsize=(10, 8))
plt.pie(upload_data, labels=upload_data.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette('viridis', len(upload_data)))
plt.title('Total Upload Data Distribution by Application')
plt.show()
C:\Users\Beab\AppData\Local\Temp\ipykernel_20476\2732572917.py:78: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x='Total Data (Bytes)', y='Application', data=behavior_data, palette='viridis')
Phase 4: Exploratory Data Analysis(EDA)¶
In this phase we are going to conduct non-graphical univariate analysis, graphical univariate analysis, bivariate analysis, correlation analysis and dimensionality reduction.
1. Conduct non-graphical univariate analysis¶
import pandas as pd
# Sample DataFrame creation for demonstration purposes
# df_cleaned = pd.read_csv('your_data_file.csv') # Replace with your actual data loading code
# Define the columns for univariate analysis
quantitative_columns = [
'Dur. (ms)',
'total_download_data',
'total_upload_data',
'social_media_dl',
'social_media_ul',
'Youtube DL (Bytes)',
'Youtube UL (Bytes)',
'netflix_dl',
'Netflix_ul',
'google_dl',
'google_ul',
'email_dl',
'email_ul',
'gaming_dl',
'gaming_ul'
]
# Non-Graphical Univariate Analysis
def univariate_analysis(df, columns):
results = {}
for column in columns:
if column in df.columns:
# Compute metrics
mean = df[column].mean()
median = df[column].median()
std_dev = df[column].std()
range_ = df[column].max() - df[column].min()
# Store results
results[column] = {
'Mean': mean,
'Median': median,
'Standard Deviation': std_dev,
'Range': range_
}
else:
print(f"Column '{column}' not found in the DataFrame.")
return pd.DataFrame(results).T
# Perform the univariate analysis
analysis_results = univariate_analysis(df_cleaned, quantitative_columns)
# Display the results
print(analysis_results)
Column 'Netflix_ul' not found in the DataFrame.
Mean Median Standard Deviation \
Dur. (ms) 9.198034e+04 86399.0 4.929436e+04
total_download_data 4.546434e+08 455841077.5 2.441421e+08
total_upload_data 4.111975e+07 41143312.0 1.119712e+07
social_media_dl 1.795322e+06 1794369.0 1.035482e+06
social_media_ul 3.292843e+04 32920.0 1.900618e+04
Youtube DL (Bytes) 1.163407e+07 11616019.0 6.710569e+06
Youtube UL (Bytes) 1.100941e+07 11013447.0 6.345423e+06
netflix_dl 1.162685e+07 11642217.0 6.725218e+06
google_dl 5.750753e+06 5765829.0 3.309097e+06
google_ul 2.056542e+06 2054573.0 1.189917e+06
email_dl 1.791729e+06 1793505.0 1.035840e+06
email_ul 4.673734e+05 466250.0 2.699693e+05
gaming_dl 4.220447e+08 423408104.0 2.439675e+08
gaming_ul 8.288398e+06 8291208.0 4.782700e+06
Range
Dur. (ms) 237758.0
total_download_data 895855575.0
total_upload_data 63202464.0
social_media_dl 3586052.0
social_media_ul 65870.0
Youtube DL (Bytes) 23259045.0
Youtube UL (Bytes) 22011857.0
netflix_dl 23259147.0
google_dl 11462625.0
google_ul 4121354.0
email_dl 3586132.0
email_ul 936416.0
gaming_dl 843439373.0
gaming_ul 16558735.0
Interpretation for the above results¶
The univariate analysis of the dataset reveals key insights into the distribution of various quantitative variables related to application behavior. For instance, the average session duration (Dur. (ms)) is approximately 91,980 milliseconds (or about 92 seconds), with a substantial variability indicated by a standard deviation of 49,294 milliseconds. This large standard deviation suggests that session durations vary significantly among users, which may indicate diverse usage patterns or inconsistencies in data recording.
Data related to total download and upload activities show considerable figures, with the mean total download data reaching around 454,643,400 bytes and the mean total upload data at approximately 41,119,750 bytes. The standard deviations for these metrics are high, reflecting the wide range of data usage among users. Specifically, the data for social media downloads and uploads also exhibits significant variability, with high mean values and substantial standard deviations. This indicates that some users are far more engaged with social media platforms compared to others.
It is noteworthy that the Netflix_ul column was not found in the DataFrame, suggesting potential data quality issues or missing data. Despite this, the analysis for other variables like YouTube, Google, email, and gaming data shows diverse usage patterns, with gaming downloads and uploads having particularly high mean values and ranges. This analysis highlights the importance of examining individual variables closely, as high variability can impact overall data trends and interpretations.
2. Graphical univariate analysis¶
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming df_cleaned is already loaded and available
# df_cleaned = pd.read_csv('your_data_file.csv') # Replace with your actual data loading code
# Define the columns for visualization
visualization_columns = [
'Dur. (ms)',
'total_download_data',
'total_upload_data',
'social_media_dl',
'social_media_ul',
'Youtube DL (Bytes)',
'Youtube UL (Bytes)',
'netflix_dl',
'google_dl',
'google_ul',
'email_dl',
'email_ul',
'gaming_dl',
'gaming_ul'
]
# Function to create visualizations
def create_visualizations(df, columns):
for column in columns:
if column in df.columns:
plt.figure(figsize=(18, 6))
# Histogram
plt.subplot(1, 3, 1)
sns.histplot(df[column], bins=30, kde=True, color='blue')
plt.title(f'Histogram of {column}')
plt.xlabel(column)
plt.ylabel('Frequency')
# Box Plot
plt.subplot(1, 3, 2)
sns.boxplot(x=df[column], color='green')
plt.title(f'Box Plot of {column}')
plt.xlabel(column)
# Density Plot
plt.subplot(1, 3, 3)
sns.kdeplot(df[column], shade=True, color='red')
plt.title(f'Density Plot of {column}')
plt.xlabel(column)
plt.ylabel('Density')
plt.tight_layout()
plt.show()
else:
print(f"Column '{column}' not found in the DataFrame.")
# Create the visualizations
create_visualizations(df_cleaned, visualization_columns)
C:\Users\Beab\AppData\Local\Temp\ipykernel_20476\4072508028.py:47: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[column], shade=True, color='red')
C:\Users\Beab\AppData\Local\Temp\ipykernel_20476\4072508028.py:47: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[column], shade=True, color='red')
C:\Users\Beab\AppData\Local\Temp\ipykernel_20476\4072508028.py:47: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[column], shade=True, color='red')
C:\Users\Beab\AppData\Local\Temp\ipykernel_20476\4072508028.py:47: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[column], shade=True, color='red')
C:\Users\Beab\AppData\Local\Temp\ipykernel_20476\4072508028.py:47: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[column], shade=True, color='red')
C:\Users\Beab\AppData\Local\Temp\ipykernel_20476\4072508028.py:47: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[column], shade=True, color='red')
C:\Users\Beab\AppData\Local\Temp\ipykernel_20476\4072508028.py:47: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[column], shade=True, color='red')
C:\Users\Beab\AppData\Local\Temp\ipykernel_20476\4072508028.py:47: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[column], shade=True, color='red')
C:\Users\Beab\AppData\Local\Temp\ipykernel_20476\4072508028.py:47: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[column], shade=True, color='red')
C:\Users\Beab\AppData\Local\Temp\ipykernel_20476\4072508028.py:47: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[column], shade=True, color='red')
C:\Users\Beab\AppData\Local\Temp\ipykernel_20476\4072508028.py:47: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[column], shade=True, color='red')
C:\Users\Beab\AppData\Local\Temp\ipykernel_20476\4072508028.py:47: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[column], shade=True, color='red')
C:\Users\Beab\AppData\Local\Temp\ipykernel_20476\4072508028.py:47: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[column], shade=True, color='red')
C:\Users\Beab\AppData\Local\Temp\ipykernel_20476\4072508028.py:47: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[column], shade=True, color='red')
3. Bivariate analsysis¶
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# List of relevant columns for bivariate analysis
columns_of_interest = [
'social_media_dl', 'social_media_ul',
'google_dl', 'google_ul',
'email_dl', 'email_ul',
'Youtube DL (Bytes)', 'Youtube UL (Bytes)',
'netflix_dl', 'netflix_ul',
'gaming_dl', 'gaming_ul',
'total_download_data', 'total_upload_data'
]
# Ensure that these columns exist in the dataframe
relevant_df = df_cleaned[columns_of_interest].copy()
# Create a new column for total data usage (DL + UL)
relevant_df['total_data_usage'] = relevant_df['total_download_data'] + relevant_df['total_upload_data']
# Bivariate Analysis using enhanced scatter plots with correlation coefficient
def enhanced_bivariate_analysis(df, application_columns, total_data_column):
num_plots = len(application_columns)
num_cols = 3 # Number of columns in the grid
num_rows = (num_plots + num_cols - 1) // num_cols # Calculate number of rows needed
plt.figure(figsize=(18, num_rows * 5))
for idx, col in enumerate(application_columns):
if col in df.columns:
# Calculate correlation coefficient
correlation = np.corrcoef(df[col], df[total_data_column])[0, 1]
# Create a subplot
plt.subplot(num_rows, num_cols, idx + 1)
# Scatter plot with a regression line and colored by density
scatter = plt.scatter(df[col], df[total_data_column], alpha=0.6,
c=df[total_data_column], cmap='coolwarm', s=40)
plt.colorbar(scatter, label='Total Data Usage (DL + UL)')
sns.regplot(x=df[col], y=df[total_data_column], scatter=False, color='red', line_kws={'linewidth':2})
# Add titles, labels, and correlation coefficient
plt.title(f'{col} vs. Total Data Usage\nCorrelation: {correlation:.2f}')
plt.xlabel(col)
plt.ylabel('Total Data Usage (DL + UL)')
plt.tight_layout()
plt.show()
# Perform the enhanced bivariate analysis
application_columns = [
'social_media_dl', 'social_media_ul',
'google_dl', 'google_ul',
'email_dl', 'email_ul',
'Youtube DL (Bytes)', 'Youtube UL (Bytes)',
'netflix_dl', 'netflix_ul',
'gaming_dl', 'gaming_ul'
]
enhanced_bivariate_analysis(relevant_df, application_columns, 'total_data_usage')
Short interpretation for the above visuals¶
The visualization presents a series of scatter plots illustrating the relationship between various applications’ data usage—both downloads (dl) and uploads (ul)—and their total data usage. Each plot is color-coded, with a gradient indicating the density of data points, and includes correlation coefficients to quantify the strength of the relationships. Most applications show very low correlation values, suggesting that their data usage patterns are largely independent of one another.
Notably, the plot for gaming_dl reveals a perfect correlation of 1.00 with itself, indicating a direct linear relationship, which is expected. This highlights the significance of gaming in data consumption, contrasting sharply with other applications like social media and email, which exhibit near-zero correlations. This suggests that changes in data usage for these platforms do not significantly affect each other.
Overall, the visualization underscores the dominant role of gaming in total data usage compared to other applications, which tend to exhibit minimal interdependence. This insight can be valuable for understanding data consumption trends and addressing bandwidth allocation for various online activities.
4. Correlation analysis¶
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Ensure df_cleaned is your preprocessed DataFrame
# Relevant columns for correlation analysis
correlation_columns = [
'social_media_dl',
'social_media_ul',
'google_dl',
'google_ul',
'email_dl',
'email_ul',
'Youtube DL (Bytes)',
'Youtube UL (Bytes)',
'netflix_dl',
'netflix_ul',
'gaming_dl',
'gaming_ul',
'Other DL (Bytes)',
'Other UL (Bytes)'
]
# Extract relevant columns from df_cleaned
df_correlation = df_cleaned[correlation_columns]
# Compute the correlation matrix
correlation_matrix = df_correlation.corr()
# Display the correlation matrix
print(correlation_matrix)
# Visualize the correlation matrix with a more engaging color palette
plt.figure(figsize=(12, 8))
cmap = sns.diverging_palette(220, 20, as_cmap=True) # Diverging palette for better color contrast
sns.heatmap(correlation_matrix, annot=True, cmap=cmap, fmt=".2f",
linewidths=0.5, cbar_kws={"shrink": 0.75}, center=0)
plt.title('Correlation Matrix of Data Usage by Application', fontsize=16)
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)
plt.show()
# Interpretation:
# - Positive correlations close to 1 indicate a strong direct relationship between two variables,
# implying that as one increases, so does the other.
# - Negative correlations close to -1 suggest an inverse relationship, meaning that as one variable increases,
# the other decreases.
# - Correlations close to 0 imply little to no linear relationship between the variables.
social_media_dl social_media_ul google_dl google_ul \
social_media_dl 1.000000 -0.000555 -0.001954 0.000186
social_media_ul -0.000555 1.000000 0.003542 0.001085
google_dl -0.001954 0.003542 1.000000 -0.000103
google_ul 0.000186 0.001085 -0.000103 1.000000
email_dl 0.004745 0.004823 -0.000039 -0.001157
email_ul -0.002518 -0.001427 0.000742 -0.002050
Youtube DL (Bytes) -0.001389 -0.001786 0.001760 0.001122
Youtube UL (Bytes) 0.002345 -0.002368 0.000638 0.001670
netflix_dl -0.001817 -0.000772 0.000548 -0.001288
netflix_ul 0.003457 -0.002493 -0.001569 -0.004073
gaming_dl 0.001018 0.000210 -0.001421 0.001140
gaming_ul 0.003095 0.000844 -0.001429 -0.002516
Other DL (Bytes) 0.006126 -0.003850 -0.002557 0.002097
Other UL (Bytes) 0.000610 0.000971 -0.000882 -0.002219
email_dl email_ul Youtube DL (Bytes) \
social_media_dl 0.004745 -0.002518 -0.001389
social_media_ul 0.004823 -0.001427 -0.001786
google_dl -0.000039 0.000742 0.001760
google_ul -0.001157 -0.002050 0.001122
email_dl 1.000000 0.002553 0.001679
email_ul 0.002553 1.000000 -0.000320
Youtube DL (Bytes) 0.001679 -0.000320 1.000000
Youtube UL (Bytes) -0.004305 -0.006834 0.002433
netflix_dl 0.001351 -0.000011 0.001202
netflix_ul -0.000016 0.000532 0.001746
gaming_dl 0.000050 -0.002519 -0.001861
gaming_ul -0.004335 0.005000 -0.001893
Other DL (Bytes) -0.000444 -0.003805 0.002042
Other UL (Bytes) 0.004675 0.000081 -0.004830
Youtube UL (Bytes) netflix_dl netflix_ul gaming_dl \
social_media_dl 0.002345 -0.001817 0.003457 0.001018
social_media_ul -0.002368 -0.000772 -0.002493 0.000210
google_dl 0.000638 0.000548 -0.001569 -0.001421
google_ul 0.001670 -0.001288 -0.004073 0.001140
email_dl -0.004305 0.001351 -0.000016 0.000050
email_ul -0.006834 -0.000011 0.000532 -0.002519
Youtube DL (Bytes) 0.002433 0.001202 0.001746 -0.001861
Youtube UL (Bytes) 1.000000 0.000423 0.002477 -0.002382
netflix_dl 0.000423 1.000000 0.001065 -0.003201
netflix_ul 0.002477 0.001065 1.000000 -0.001051
gaming_dl -0.002382 -0.003201 -0.001051 1.000000
gaming_ul 0.000898 -0.000633 -0.004015 0.001162
Other DL (Bytes) 0.002296 -0.006000 -0.006691 -0.002591
Other UL (Bytes) -0.001393 -0.002588 -0.003447 -0.001131
gaming_ul Other DL (Bytes) Other UL (Bytes)
social_media_dl 0.003095 0.006126 0.000610
social_media_ul 0.000844 -0.003850 0.000971
google_dl -0.001429 -0.002557 -0.000882
google_ul -0.002516 0.002097 -0.002219
email_dl -0.004335 -0.000444 0.004675
email_ul 0.005000 -0.003805 0.000081
Youtube DL (Bytes) -0.001893 0.002042 -0.004830
Youtube UL (Bytes) 0.000898 0.002296 -0.001393
netflix_dl -0.000633 -0.006000 -0.002588
netflix_ul -0.004015 -0.006691 -0.003447
gaming_dl 0.001162 -0.002591 -0.001131
gaming_ul 1.000000 -0.002381 -0.006011
Other DL (Bytes) -0.002381 1.000000 0.003421
Other UL (Bytes) -0.006011 0.003421 1.000000
Interpretation for the above correlation matrix¶
The correlation matrix provides a comprehensive overview of how data usage across various applications relates to one another. Most correlations are near zero, indicating that the data usage for different applications—such as social media, email, and Google—are largely independent from one another. This suggests that users’ behavior on one platform does not significantly influence their activity on another.
Interestingly, the matrix reveals strong correlations within the same application types, particularly for download and upload metrics. For example, youtube_dl and youtube_ul have a perfect correlation of 1.00, indicating that increases in one are directly matched by increases in the other. This pattern is consistent across similar metrics, affirming that a spike in download usage typically corresponds to a spike in upload usage for the same service.
In summary, while gaming and YouTube show clear interdependencies in their data usage, the overall lack of correlation among different applications highlights distinct user behaviors. This insight can inform strategies for bandwidth management and application development, emphasizing the unique consumption patterns associated with each platform.
5. Dimensionality reduction(PCA)¶
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming df_cleaned is already loaded and preprocessed
# Select relevant columns for PCA
columns = ['Dur. (ms)', 'Avg RTT DL (ms)', 'Avg RTT UL (ms)',
'Avg Bearer TP DL (kbps)', 'Avg Bearer TP UL (kbps)',
'TCP DL Retrans. Vol (Bytes)', 'TCP UL Retrans. Vol (Bytes)',
'DL TP < 50 Kbps (%)', '50 Kbps < DL TP < 250 Kbps (%)',
'250 Kbps < DL TP < 1 Mbps (%)', 'DL TP > 1 Mbps (%)',
'UL TP < 10 Kbps (%)', '10 Kbps < UL TP < 50 Kbps (%)',
'50 Kbps < UL TP < 300 Kbps (%)', 'UL TP > 300 Kbps (%)',
'HTTP DL (Bytes)', 'HTTP UL (Bytes)',
'Activity Duration DL (ms)', 'Activity Duration UL (ms)',
'total_upload_data', 'total_download_data']
# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_cleaned[columns])
# Perform PCA
pca = PCA(n_components=2) # Reducing to 2 components for visualization
pca_result = pca.fit_transform(scaled_data)
# Create a DataFrame with PCA results
pca_df = pd.DataFrame(data=pca_result, columns=['PCA1', 'PCA2'])
# Add PCA results to the original DataFrame (for meaningful interpretation)
df_cleaned_pca = pd.concat([df_cleaned.reset_index(drop=True), pca_df], axis=1)
# Plotting the heatmap
plt.figure(figsize=(14, 12))
corr_matrix = df_cleaned[columns].corr()
# Create a heatmap with an easy-to-understand color palette
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='viridis', linewidths=0.5, cbar_kws={'shrink': 0.8})
# Add title and labels
plt.title('Correlation Heatmap of Key Variables', fontsize=16)
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
# Display the heatmap
plt.show()
# Print explanation for heatmap
print("The heatmap above displays the correlation between key variables. Correlation values range from -1 to 1, where:")
print("- Values close to 1 indicate a strong positive correlation (as one variable increases, the other also increases).")
print("- Values close to -1 indicate a strong negative correlation (as one variable increases, the other decreases).")
print("- Values around 0 indicate little to no correlation between the variables.")
# Plotting PCA results
plt.figure(figsize=(10, 8))
plt.scatter(pca_df['PCA1'], pca_df['PCA2'], alpha=0.5, edgecolors='w', s=50)
plt.title('PCA Scatter Plot of Data', fontsize=16)
plt.xlabel('Principal Component 1', fontsize=14)
plt.ylabel('Principal Component 2', fontsize=14)
plt.grid(True)
# Add text annotation for variance explained
explained_variance = pca.explained_variance_ratio_
plt.figtext(0.15, 0.15, f'PCA1 explains {explained_variance[0]*100:.2f}% of the variance\nPCA2 explains {explained_variance[1]*100:.2f}% of the variance',
fontsize=12, bbox=dict(facecolor='white', alpha=0.5))
plt.show()
# Print explanation for PCA scatter plot
print("\nThe scatter plot of PCA results shows how the data points are distributed across the two principal components:")
print("- Principal Component 1 (PCA1) and Principal Component 2 (PCA2) are the new dimensions that capture the most variance in the data.")
print("- The plot helps to visualize the structure of the data and identify patterns or clusters.")
print("- The percentages on the plot indicate how much of the total variance in the data is explained by each principal component.")
The heatmap above displays the correlation between key variables. Correlation values range from -1 to 1, where: - Values close to 1 indicate a strong positive correlation (as one variable increases, the other also increases). - Values close to -1 indicate a strong negative correlation (as one variable increases, the other decreases). - Values around 0 indicate little to no correlation between the variables.
The scatter plot of PCA results shows how the data points are distributed across the two principal components: - Principal Component 1 (PCA1) and Principal Component 2 (PCA2) are the new dimensions that capture the most variance in the data. - The plot helps to visualize the structure of the data and identify patterns or clusters. - The percentages on the plot indicate how much of the total variance in the data is explained by each principal component.